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Abstract — This paper introduces a modeling frame- 
work for distributed regression with agents/ experts ob- 
serving attribute-distributed data (heterogeneous data). 
Under this model, a new algorithm, the iterative covari- 
ance optimization algorithm (ICO A), is designed to re- 
shape the covariance matrix of the training residuals of 
individual agents so that the linear combination of the 
individual estimators minimizes the ensemble training 
error. Moreover, a scheme (Minimax Protection) is de- 
signed to provide a trade-off between the number of data 
instances transmitted among the agents and the perfor- 
mance of the ensemble estimator without undermining 
the convergence of the algorithm. This scheme also pro- 
vides an upper bound (with high probability) on the test 
error of the ensemble estimator. The efficacy of ICO A 
combined with Minimax Protection and the comparison 
between the upper bound and actual performance are 
both demonstrated by simulations. 

Keywords: Distributed learning, heterogeneous data, 
cooperative training 

1 Introduction 

Distributed learning is a field that generalizes classi- 
cal machine learning algorithms to a distributed frame- 
work. Unlike the classical learning framework, where 
one has full access to all the data and has unlimited 
central computation capability, in the framework of 
distributed learning, the data are distributed among a 
number of agents, who have limited access to the data. 
These agents are capable of exchanging certain types 
of information, which, due to limited computational 
power and communication restrictions (limited band- 
width or limited power), is usually restricted in terms 
of content and amount. Research in distributed learn- 
ing seeks effective learning algorithms and theoretical 
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limits within such constraints on computation, commu- 
nication, and confidentiality 

Distributed learning can be categorized into many 
subareas. In terms of the way that the data are dis- 
tributed, it can be categorized into homogeneous data 
(instance/horizontally distributed data) and heteroge- 
neous data (attribute/ vertically distributed data). In 
terms of the structure of the entire system, it can be cat- 
egorized into systems with a fusion center and systems 
without a fusion center. Each category has its unique 
applications and challenges. We focus, in this paper, on 
the case in which data are attribute-distributed. The 
algorithm that we develop can be adapted to systems 
either with or without a fusion center. 

The homogeneous-data problems have been widely 
studied. Two important types of models are established 
in [T] and [5] respectively: instance distributed learn- 
ing with and without a fusion center. The relationship 
between the information transmitted among individual 
agents and the fusion center and the ensemble learn- 
ing capability are discussed in these papers. Classical 
learning algorithms are more easily adapted to the ho- 
mogeneous cases because for each agent, the form of the 
classifier/estimator is exactly the same as that of the 
centralized learning algorithm. The homogeneity in the 
individual classifiers/estimators is a great advantage for 
designing distributed learning algorithms that compare 
and combine them. 

However, these advantages disappear in the heteroge- 
neous data case, where different agents observe different 
attributes, and thus have many different forms of clas- 
sifiers/estimators. This makes it harder to evaluate, 
compare and combine the estimators. Nevertheless, 
there are some research results in this area (e.g., [I], 
[B]). Some basic ideas include voting/averaging, meta- 
learning, collective data mining, and residual refitting. 
The voting/averaging algorithm simply combines (lin- 
early) the predictions of the individual agents. The 
training process is purely non-cooperative. In the meta- 
learning case (see [3] and [3]), the fusion center seeks a 



more sophisticated way to integrate predictions of indi- 
vidual estimators by taking their predictions as a new 
training set (learning of learning results), i.e., the fusion 
center treats the output of individual estimators as the 
input covariates. Although this hierarchical training 
scheme looks more delicate, it is still non-cooperative 
and hence fails to learn hidden rules in which covariates 
of different agents intertwine in a complicated way. 

In contrast, the collective data mining algorithms 
(see [5], [Z] and [H]) are cooperative. They seek the 
information required to be shared among the agents so 
that the optimal estimator can be decomposed into an 
additive form without compromising the performance 
of the ensemble estimator (compared to the estimator 
trained by the centralized algorithm) . Yet this require- 
ment is rather strong and hence this technique relics on 
specific types of transformations, which require much 
prior knowledge of the problem, and thus is hard to 
generalize to other problems. The residual refitting 
algorithm, another cooperative training algorithm de- 
scribed in [3], has the advantage of not being depen- 
dent on individual learning algorithms. The only way 
that the agents communicate with each other is through 
their residuals. However, these algorithms are based on 
an additive model, and are susceptible to overtraining 
and pitfalls of local optima, though under some assump- 
tions, optimality can be guaranteed. 

In this paper, we develop another cooperative train- 
ing scheme, using a modeling framework similar to that 
of the residual-refitting algorithms. However, instead 
of refitting the residuals directly, our new algorithm 
seeks to reshape the covariance matrix of the residuals 
generated by all the agents so that the linear combi- 
nation of the estimators maintained by the agents can 
achieve a low ensemble test error. Again, residuals are 
the only information that the agents communicate to 
each other, yet they are used more intelligently than in 
the case of residual-refitting. In the case when residual- 
refitting is guaranteed to achieve global optimality, our 
new algorithm, iterative covariance optimization algo- 
rithm (ICOA) also achieves similar results - and due to 
its insusceptibility to overtraining, ICOA usually out- 
performs. 

Another important issue for a distributed learning 
system is the trade-off between the amount of informa- 
tion exchanged among the agents and the performance 
of the ensemble estimator/classifier. To study this re- 
lationship, the major challenge is how to quantify the 
information exchanged and the ensemble performance. 
In this paper, based on ICOA with Minimax Protec- 
tion, the relationship between the amount of informa- 
tion exchanged (measured by the compression rate) and 
the optimal test error of the ensemble estimator is dis- 
cussed. More interestingly, an upper bound on the test 
error of the ensemble estimator is also derived. 

The rest of this paper is organized as follow. In Sec- 
tion 2, we describe the basic model and abstract the 



problem of finding an optimal additive ensemble esti- 
mator into a two-stage optimization. In Section 3, we 
analyze this optimization problem and introduce ICOA, 
with its efficacy demonstrated by simulation. In Sec- 
tion 4, we discuss the problem of how to keep ICOA 
functioning when the covariance is not accurately es- 
timated, which leads to Minimax Protection, and we 
demonstrate the trade-off between data transmission 
and system performance with an upper bound on the 
test error with respect to the data compression rate. 
Section 5 contains our conclusions. 

2 Model and problem 

Our discussion is based on an estimation/regression 
problem with attribute-distributed data. The estima- 
tion problem is specified as follow: 

There are M covariates (or attributes) X%, . . . ,Xm 
and one outcome Y, so the entire data set of N in- 
stances is comprised of 

{(x il ,x i 2, . ■ .,Xi M ,yi)}iLi 

where N is the number of instances, Xij € K is the i-th 
instance of Xj, and j/i G R is the i-th instance of Y. 

We assume that there exists a hidden deterministic 
function (rule/hypothesis) <p '■ 1 M — » K such that 



Vi 



(xii,x i2 , . . . , x lM ) + m, 



(1) 



where {u>i}fL 1 is an independently drawn sample from 
a zero-mean random variable W that is independent of 
Xi,... ,X M and Y. 

Suppose there are D agents, each of which has only 
limited access to certain attributes. Define Fj (j = 
1, . . . , D) to be the set of attributes accessible by agent 
j, and define F = U^LjFj, assuming that |F| = M. 
The outcome Y, with all its instances, is visible to all 
the agents. These assumptions specify the "attribute- 
distributed" properties of our problem. 

To highlight the distributed nature of the system, 
we add an extra restriction: the only information that 
the agents can communicate with each other is their 
training residuals (or information that can be locally 
derived from the training residuals). This is a reason- 
able assumption considering that the data observable 
by one agent are usually confidential or incompatible 
with the learning algorithm run by another agent. 

Therefore, for these D agents, each agent i maintains 
an estimator /j of the outcome, which is a function 
that takes covariates Xp t as its input. Given individual 
estimators fi fixed, the problem of finding an optimal 
ensemble estimator of additive form can be described 
as an optimization problem 
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difi(X 
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where a* are the weighting coefficients. Moreover, if we 
assume that each estimator has no "bias" after train- 
ing, or equivalently, if we assume that the residuals 
have zero mean, then it follows that E[fi(X F . )] = E[Y], 
Therefore, it is obvious that the sum of all weighting 
coefficients is equal to 1, i.e. YliLi a i = 1- 

Consequently, we can rewrite the objective function 



as 



E 



^2 * [Y-fi(X Fi )} 



(3) 



Note that the ith term in the parentheses is the residual 
of the ith agent, defined as Ri — Y — fi(X Fi ). There- 
fore, the objective function can be rewritten as 



E 



D 



(4) 



To simplify our derivation, define the covariance ma- 
trix of the residuals as A where [A]y = cov(R i , Rj); 
then the problem can be further simplified into a more 
concise form: 



min a Aa 



s.t. l T a : 



1 



(5) 
(6) 



where a = [ ai ■ • • &£> ] . This is our start- 

ing point for the distributed regression problem. The 
optimization problem of finding the best ensemble esti- 
mator of the form of a linear combination of individual 
estimators is equivalent to finding the best individual 
estimators that generate the most desirable residuals 
that cancel each other out. 

The problem described by ([5]) and ([6]) is readily 
solved if the covariance matrix A is known and fixed. 
This is equivalent to the problem of finding the best 
linear combination of the estimators with these esti- 
mators given and fixed - a case that appears in the 
non-cooperative training algorithms. However, in a co- 
operative training algorithm, by communicating with 
each other, the agents have a chance to change their 
training residuals intelligently and repeatedly so as to 
reshape the covariance matrix A and to minimize the 
ensemble error. It is this step that makes the problem 
interesting and difficult. Therefore, the entire problem 
can be summarized as a two-stage optimization prob- 
lem: 



min min a Aa 

A a 



s.t. l T a : 



1 



A subject to training restrictions. 



(7) 

(8) 
(9) 



Note that A is subject to training restrictions because 
the residual generated by each agent is not arbitrary. 
For Ri, it must be achievable in the form of Y— fi(X Fi ), 
which is highly restrictive because of the space to which 
fi belongs. 



3 Solution to the two-stage opti- 
mization 

The first (inner) step of the optimization has a closed 
form solution (solved by Lagrange multipliers). When 



A 11 

a " 1 T A il : 

the minimum value rj is achieved: 

1 

11 ~ 1 T A _1 1' 



(10) 



(11) 



i.e. the minimum value is the inverse of the sum of all 
the elements of the inverse of A. Moreover, since A is 
a covariance matrix, and thus must be positive definite, 
the second stage optimization problem is equivalent to 



maxl r A _1 l. 

A 



(12) 



It is necessary to bear in mind that A is subject to 
training restrictions. 



The optimization problem described in (12 1 is the 
key step of our algorithm. The most difficult step is 
to quantify the "training constraints" of the covariance 
matrix of the residuals. To tackle this problem, it is 
necessary to examine the inner structure of A. 

As previously assumed, we have D agents, and each 
agent i maintains an estimator specified by the function 
/j(Xjr 4 ). Then, obviously, the covariance matrix A can 
be expressed as 



[A]« = E [(Y" — MX F J) (Y - MX Fj ))] 



(13) 



where [A]y stands for the element of A in the ith row 
and jth column. However, for numerical purposes, we 
need to write the matrix in the form of a statistic by 
describing everything in terms of actual data. So we 
characterize the function fi by a vector f^, which is the 
prediction of the function /, on all the training data 
points of agent i. Similarly, we define y as the value 
of the outcome Y for all data instances. Then, the 
covariance matrix A can be estimated by (with the as- 
sumption of the unbiasedness of all the estimators) 



[A], 



1 

N 



(y-f*r (y-fi) 



(14) 



With this notation, the optimization problem can 
now be converted into a more specific and imple- 
ment able one: 



max 1 T A *1 

fi...fn 



s.t. f, S Hi 



1, 



,D, 



(15) 
(16) 



where Hi denotes the space to which fj belongs, which 
depends on the class of functions in which fi resides. 
Thus, the constraints on A are implicitly included in 
the constraints of the vectors fx, ... , f m . 



The next step, ordinarily, is to massage the optimiza- 
tion problem and to prove the convexity of the objective 
function and the domain so that we can apply gradi- 
ent descent algorithms and guarantee global optimal- 
ity Yet for our problem, since the objective and con- 
straints are both rather intricate and to some extent 
only implicitly specified, it is not very feasible to prove 
convexity without additional assumptions. Therefore, 
we will directly develop an algorithm based on gradient 
descent and test the algorithm empirically before we 
delve deeply into the problem of global optimality. 

3.1 Iterative covariance optimization 
algorithm 

The first thing required for a gradient-based algorithm 
is to find an expression for the gradient of the objective 
1) = 1 T A _1 1 with respect to fj. By rather lengthy and 
intricate computation, a closed form expression for the 
gradient is given by 



c)i] 




D 



1 T A*1) X>-fj)[ A *]« 



\i =1 



£(f*-fi)[B*(fc)]y , 



where k ^ i, A* denotes the adjoint of A, and B(fc) is 
a (D — 1) x [D — 1) matrix given by 

[B(k)] l] = ({ k -{ l+c J T ({ k -{ J+Cjk ) 

where Cifc = if i < k, and Cifc = 1 if i > fe. 

This provides us with a feasible yet complex algo- 
rithm for estimating the gradient. It is worth noting 
that the gradient depends only on the residuals of the 
agents (through easy conversion of fi as y — r,j). In 
practice, we can also use numerical methods to esti- 
mate the gradient, i.e. we can perturb the components 
of fi, compute the change of the objective and use the 
ratio between the change and the perturbation as an 
approximation of that component of the gradient. 

There is another important issue before we develop 
the algorithm: we can search, by gradient descent, for 
a desirable to replace fi so that we can increase the 
value of the objective function, yet fi might not be 
achievable because agent i may not be able to find a 
new estimator fi such that fi is realizable by 
Therefore, what is reasonable to do is to use fi as the 
new outcome for agent i (instead of y) to train and find 
a new estimator /j, i.e., we find the best projection of 
fi onto the space Hi. 

Based on the description above, the basic idea of 
ICOA is summarized as follows. First, cooperatively, 
all the agents determine the present covariance matrix 
of their residuals A. Then, one by one, each agent 
finds its estimate of the gradient 91 £ — - , after which 



the selected agent % updates its vector fi to fi using gra- 
dient descent. After that, agent i projects fi onto Hi 
by training with fi as the outcome and thus obtains the 
new version of fi . Then, after agent i updates its resid- 
ual, all the agents update their estimates of covariance 
matrix A. 

More precisely, the algorithm is as shown below: 



while |r?„ - 77„_i| > e do 
for i from 1 to D do 



1. Given current A, compute dl ^ — 

2. Back-search for the optimal step size A; 

3. f^fi + Ax^; 

4. Train fi(Xp t ) with fi as the outcome; 

5. Use fi to update the training residual 
of agent i and update A; 



end 



Vn+l <" 

n <— n 



\ T A- X \- 

hi; 



end 



3.2 Simulation for regression problems 

In order to compare distributed regression implemented 
by ICOA to other multi-dimensional regression algo- 
rithms (distributed or non-distributed), we use three 
functions used in [TO] as the hidden rule to generate 
our simulation training data sets. The three functions 
and the corresponding joint distribution of the covari- 
ates are: 

• Friedman- 1: 

0(x) = 10sin(7ra; 1 a;2)+20(a;3-l/2) 2 +10a;4+5x5+u;, 
where Xj ~ U[0, 1], j — 1, ... ,5; 

• Friedman-2: 



(x) 



1 



where 

xt - C/[l,100], x 2 - £7[4G7r,5607r], 
x 3 ,x 5 ~U[0,l], X4~C/[1,11]- 
• Fricdman-3: 



(x) = tan 



-1 , ^3 - — 



xi 



w, 



where the distributions of the covariates are the 
same as those of Friedman-2. 



All the covariates are independent of one another, and 
before running the algorithm, the outcomes are normal- 
ized to the range [0,1]. Also, to highlight the effects of 
the distributed nature of the system, the independent 
additive white noise w is set to a negligible level in our 
simulation. Furthermore, it is worth pointing out that 
in Friedman-2 and Friedman-3, attribute A5 is irrele- 
vant, serving purely as a nuisance variable. 

The structure of the entire distributed system is as 
follows. There are 5 attributes, Xx, . . . , X5, and we 
assume that there are 5 agents, with agent i observing 
attribute Xi exclusively. Each agent uses a regression 
tree as its individual estimator. 

With the setup above, the simulation results of the 
two algorithms are as shown in Table [l] As a compari- 
son, we ran two other distributed regression algorithms: 
averaging and residual refitting (or ICEA, see [5] for de- 
tails). 



Friedman Data set 


1 


2 


3 


ICOA 


.0047 


.0095 


.0086 


Residual Refit 


.0047 


.0101 


.0096 


Averaging 


.0277 


.0355 


.0312 



Table 1: Test errors (mean squared) of ICOA, the resid- 
ual refitting algorithm and the averaging algorithm on 
Friedman- 1, -2 and -3. 

Generally speaking, the performance, measured in 
terms of test error, of ICOA is slightly better than that 
of the residual refitting algorithm for these three cases, 
while being much better than the averaging algorithm. 
More importantly, ICOA has shown little sign of over- 
training, yet this is not the case for residual refitting. 
This is demonstrated in Figure. [T] 
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Figure 1: Comparison between the convergence 
of ICOA and the residual-refitting algorithm for 
Friedman- 1. ICOA is less susceptible to overtrain- 
ing. The training error of ICOA basically parallels the 
trends of the test error. Yet for the residual-refitting 
algorithm, although the training error converges to 
rapidly, it does not correctly reflect the trend of the 
test error of the ensemble estimator. 

Note that for the residual refitting algorithm, the test 
error curve turns up as the rounds of iteration increase, 



even if the training error is consistently decreasing. On 
the contrary, the test error curve and training error 
curve of ICOA are almost parallel and horizontal. This 
suggests that ICOA knows when to stop unnecessary 
overtraining and its training error is a good indicator 
of its test error. This property is highly desirable. 

The insusceptibility of ICOA to overtraining is a re- 
sult of the fact that whenever an agent optimizes its 
estimator, it takes into consideration the predictions 
(represented by the residuals) of all the other estima- 
tors, instead of only one (as in the case of the residual 
refitting algorithm). Compared with residual refitting, 
ICOA is a less "greedy" algorithm in reducing its train- 
ing error, and once it reaches the optimal ensemble es- 
timator, the covariance matrix, known to all agents, 
would prevent the agents from changing their estima- 
tors any further, unlike in the residual refitting cases, 
where the agents are busy fitting the very last noise left 
in the residuals, the culprit for overtraining. 

4 Optimization under inaccurate 
covariance 

Although ICOA has an advantage in the performance 
of its ensemble estimator compared to other distributed 
algorithms, it requires more communication among the 
agents. In the voting/averaging algorithm, no residual 
transmission is required. In the residual refitting algo- 
rithm, for each iteration, residuals need to be trans- 
mitted D times in total, or once for each agent (as 
soon as an agent finishes training, it sends its resid- 
ual to the next agent). However, for ICOA, residu- 
als need to be transmitted D(D — 1) times in total, 
or D — 1 times for each agent (each agent needs to 
send its residual to other agents whenever an agent 
finishes training, because each agent needs all the lat- 
est training residuals to compute the new covariance 
matrix). If the total number of data instances is N, 
then in terms of data transmission, the complexity is 
O(l) for voting/averating, O(ND) for the residual re- 
fitting algorithm, and 0{ND 2 ) for ICOA. The most 
communication-intensive algorithm among the three is 
ICOA. This is highly undesirable when the number of 
agents is large. This is illustrated in Figure [2] 

To reduce the amount of data needed to be trans- 
mitted among the agents for ICOA, we can relax the 
accuracy for the estimation of the covariances. Yet this 
compromises the first (inner) step of our two-stage op- 
timization specified by d7|. We need to develop an al- 
gorithm that still functions when A is not fixed, but 
can take values over a non-singleton domain. 

4.1 A minimax problem 

Given the covariance matrix A, the standard optimiza- 
tion problem for ICOA is given by ^ and How- 
ever, when we restrict the amount of data that can be 
transmitted, the estimation of the covariance matrix is 




Figure 2: Illustrations of communication requirements 
for voting/averaging, residual refitting, and ICOA 
(from left to right). For a system of D agents and iV 
training data instances, in terms of data transmission, 
the complexity for voting/averating algorithm is O(l), 
the complexity for the residual refitting algorithm is 
0{ND) and the complexity for ICOA is 0{ND 2 ). 



not accurate enough. We can model this by allowing A 
to be of any value in a range, i.e. 



AeCnP, 



(17) 



where 



C = {S | [S]<i G [ [A ]« - Sij, [A ]« + % ]} (18) 

and V stands for the class of semi-positive definite ma- 
trices of the same size as A. In other words, the element 
[A]ij has a range of length 25ij centered at [Ao]y, with 
the semi-positivity of A guaranteed. 

This assumption addresses the inaccuracy of the es- 
timation of the covariance matrix A. Moreover, the 
choice of a should take consideration of the worst case 
of A, which could be anywhere in C (here we neglect 
the extra constraint of V, and this makes our adver- 
sary A even worse for the minimization), i.e., we have 
a minimax optimization problem for the choice of a: 



minmaxa T Aa 
a Aec 

s.t. a T l = 1. 



(19) 
(20) 



The solution to the inner maximization is straightfor- 
ward, because we can decompose this problem into 
D x D independent optimization problems: 



max a* at [AL,-. 

[A]y 

Obviously, when c^a^ > 0, [A],-j = [Ao]» 



(21) 



wise, [A]ij = [A ]i 



5ij, other- 



More concisely, 



[Ajij = [A ]ij + %sgn(a i a i ). 



(22) 



To simplify our next step, we now need to make a 
few more assumptions. First, since the diagonal ele- 
ments of A are estimated locally, i.e., no data need be 
transmitted to estimate residual variances of individual 
agents, it is reasonable to assume Su = 0, i = 1, . . . , D. 
Another assumption is <Jy = 5 > Q, i =^ j; thus we can 
characterize the uncertainty of the estimation of covari- 
ance using a single number. This might sacrifice some 



accuracy of our model, yet it simplifies our problem at 
least for a preliminary exploration. 

With these additional assumptions, the optimal value 
£ of the maximization step in (|19[) is given by 



(23) 



C = a T A a + 26 |aj|l a i 



Thus, we can rewrite the objective of the minimax prob- 
lem as 

mina T A a + 2^y^ |a*||a,-|. (24) 

Unfortunately, the objective function of this problem is 
not always convex, and there is no closed form solution. 
To show the conditions for convexity, we rewrite the 
objective function as 



(Ao-<5I)a+<5(]TH) 5 



(25) 



It is easy to show that the second term, the "penaliza- 
tion term", Sly*- 1 | ) 2 , is a convex function. And the 
convexity of the first term is dependent on the value of 
5. Since Ao is a covariance matrix, i.e. it is positive 
definite, the convexity of a T (Ao — 51) a is hence equiv- 
alent to 5 < A m in, where A m ; n is the smallest eigenvalue 
of A . 

The second term serves as a penalization, restrict- 
ing the magnitudes of the coefficients. It is similar to 
Lasso Regression, except for the square. This term can 
be crudely interpreted as follows: when the covariance 
matrix is not accurately known, it is not wise to fully 
minimize the ensemble training residual without pay- 
ing attention to the complexity of the ensemble model 
(measured by the squared L-l magnitude of the weight- 
ing coefficients). 

Even if the problem is not convex, if the change in A 
is not too large, the solution to ^ is a fairly good initial 
value and gradient descent can be applied to solve the 
problem specified by (24) and 



4.2 ICOA with Minimax Protection 

The above derivation actually changes the inner step 
of our two-stage optimization, and we no longer have 
a closed form solution. Nonetheless, we can still run 
ICOA numerically, because we can still use perturba- 
tion to estimate the influence of the change of fi on the 



value of ( 25 1 , given that we can numerically solve the 



inner minimization. 

Obviously, if we know the covariance accurately, 
changing the inner step from minimization to minimax, 
i.e. changing (Jsjl to ( 24 1 has no advantage. On the con- 
trary, it compromises the performance of the ensem- 
ble estimator and slows down the convergence speed of 
ICOA. However, if we add restrictions on the number 
of data instances exchanged between two agents, this 
makes Ao, the estimate of A, less accurate, and then 
the minimax optimization is of utmost importance for 



the convergence of ICOA. We call this procedure Min- 
imax Protection. 

For instance, if we transmit only 1/a of the total TV 
data instances (randomly sampled from all the data in- 
stances) for covariance estimation, say, a = 100, then 
the estimate Ao has a large variance. Thus if we di- 
rectly substitute this estimate into the ICOA algorithm, 
it causes inaccurate and unstable estimation of the di- 
rection for gradient descent and prevents the algorithm 
from converging. This phenomenon is illustrated in Fig- 
ure [3] where the compression rate is a = 100 (only 1% 
of the data are transmitted for each iteration) and 8 = 
(no Minimax Protection). 
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Figure 3: ICOA without Minimax Protection for 
Friedman- 1. The training/test errors oscillates wildly 
and fail to converge. There is no way to decide when 
to stop the iterations. 

However, if we foresee the inaccuracy and error of 
the estimation of the covariance matrix, and properly 
choose a value for <5, then not only can we prevent the 
oscillation of the training/test error, but we also sacri- 
fice little in the performance of the ensemble estimator. 
In Figure [4] Minimax Protection is applied to ICOA, 
with a = 100 and 6 = 0.8. 
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Figure 4: ICOA with Minimax Protection for 
Friedman- 1. The training/test errors decrease al- 
most monotonically and converge rather quickly and 
smoothly, with a reasonable compromise in perfor- 
mance. 



5. In this simulation, the data set is Friedman-1, and 
the system configuration is the same as the simulation 
in the previous section. The individual estimator is of 
the form of a 4th order polynomial. 



a 


1 


10 


50 


200 


800 


S = 0.00 


.0037 


NaN 


NaN 


NaN 


NaN 


8 = .050 


.0044 


.0045 


NaN 


NaN 


NaN 


5 = .500 


.0051 


.0056 


.0052 


NaN 


NaN 


5 = .750 


.0071 


.0071 


.0073 


.0077 


NaN 


5= 1.00 


.0086 


.0086 


.0086 


.0090 


.0098 


5 = 2.00 


.0112 


.0111 


.0112 


.0114 


.0113 



Table 2: Test errors (mean squared) of ICOA with Min- 
imax Protection for Friedman-1. For certain values of 
a and S, ICOA does not converge, and the test error 
exceeds machine limits and hence cannot be obtained. 

It is worth pointing out two phenomena. First, when 
ICOA with Minimax Protection converges, the perfor- 
mance is almost independent of the compression rate 
a. Second, given a, when the value of 8 is above a 
certain level, ICOA almost always converges, yet below 
that level, ICOA does not converge. These two phe- 
nomena allow us to find an optimal 6 for every given 
a, so that we can optimize the performance of ICOA 
under a given compression rate. 

In Table [2] another dramatic phenomenon is the case 
for a = 800 and 8 = 1.00. Since we have only 4000 
training data instances, this means in each iteration, 
we use only 5 pairs of numbers to estimate the co- 
variance between two agents. And Minimax Protection 
with properly selected 8, enables us to achieve a decent 
test mean square error of .0098, only about 2.5 times of 
the optimal value .0037. Yet only 1% of the data trans- 
mission is needed compared with the amount needed 
in the optimal case (after taking into consideration the 
longer convergence time) . Thus ICOA provides us with 
a very useful tool to trade off between performance and 
data transmission. 

4.3 Upper bound of the test error 

From the simulation results shown in Table [2] it is 
of interest to investigate the relationship between the 
compression rate a and the optimal performance (mea- 
sured by test error) of the system. As analyzed previ- 
ously, the key is to select a proper 8 so that we neither 
under-protect ICOA (leading to unstable convergence) 
nor over-protect (leading to worse performance). This 
requires us to investigate the statistical properties of 
the estimator of the correlation coefficient between two 
random variables. In llj, it is shown that the pivot 
statistic T/v of the sample correlation coefficient has 
the Student's t-distribution; that is, 



The results of a series of simulations are shown in 
Table [2] for different values of compression rate a and 



5o P t(a) = min{1.96(T^ ax / ^/ JV/ a, 2a% ua }. 



where N is the number of data instances. Therefore the 
95% confidence interval of the correlation coefficient is 
given by [p- f,p + £], where £ = 1.96(1 - p 2 )/VN. 

If we assume that the largest variance of all the resid- 
uals is crj^x, then an approximation to the optimal 5 
(as a function of a) can be given by 

(27) 

The basic idea is to find the smallest 5 that covers, with 
high probability the possible domain of the covariance 
matrix, given a crude estimate A . 

With this approximation, we are able to develop an 
upper bound on the test mean square error as a function 
of a. Define Aj n i as the covariance matrix (accurate) of 
the residuals of all individual estimators before we run 
ICOA. For each step, ICOA with Minimax Protection 
improves the test error (not merely training error), be- 
cause Minimax Protection guarantees, with high prob- 
ability, that the true covariance matrix is in the range 



C defined in (18 1. Therefore, the solution to 



mina T (A ini - 5 opt (a)I)a + £ opt (a)(V H) 2 (28) 

a * — * 

i 

with constraint (|6j) provides us with an upper bound 
(with high probability) on the generalization error with 
respect to the compression rate a. Figure [5] illustrates 
the comparison between this upper bound and the sim- 
ulated optimal performance. 
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Figure 5: Comparison between the upper bound given 



by (28 1 and the simulated test error. 



5 Conclusions 

In this paper, we have shown that ICOA, as a coop- 
erative training algorithm, demonstrates its efficacy for 
finding an optimal ensemble estimator of additive form, 
while demonstrating an insusceptibility to overtraining. 
Moreover, Minimax Protection provides us with a tool 
to run ICOA when covariances are not accurately esti- 
mated, and hence enable us to trade off between per- 
formance and data transmission. Minimax Protection, 
combined with ICOA, also helps us to develop an upper 
bound on the test error for the ensemble estimator. 
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